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1. INTRODUCTION 

Hepatitis or known as an inflammation of the liver is a condition that can change to cirrhosis, 
fibrosis and liver cancer also can be self-limiting [1]. The highest common cause of hepatitis globally is 
hepatitis viruses but others can also cause hepatitis like autoimmune diseases and toxic substances [1]. 

There are 5 main types of hepatitis virus, such as virus types A, virus types B, virus types C, virus 
types D and also virus types E [1], [2]. Because of the encumbrance of ailment and death, these types are the 
greatest concern, also the possibility for outbreaks and visitation spread [1], [2]. Specially, in hundreds of 
millions of people, types B and C guide to chronic disease and also the most prevalent cause of cancer and 
liver cirrhosis [1], [2]. 

Hepatitis A and hepatitis E usually caused by ingestion of water and food contamination, where 
Hepatitis B, Hepatitis C and Hepatitis D is caused by infected body fluids which result of parenteral contact 
[3]. Blood contamination (products), equipment contaminated for medical procedures and transmittal from 
parent to child at nativity (or family members to kids), as also genital contexture are the prevalent modes of 
transmission for these viruses [3]. The infection may occur with limited or no symptoms, but also may 
include some symptoms like abdominal pain, dark urine, extreme fatigue, jaundice, nausea or vomiting [3]. 

Because Indonesia as a great archipelago, the predominance of viral infections varies exceptionally 
by territory of patients acute hepatitis [3] 43% to 68% infected by the virus of hepatitis A, 6% to 26% 
infected by the virus of hepatitis B, and 15% to 37% infected by the virus of non-hepatitis virus A nor B. 
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In 36% to 100% from a child of 5 year old, non-hepatitis virus A or called as Anti-HAV antibodies 
were detected [4]. In the general population, the prevalence of HBs-Ag has been estimated at 2.4% to 9.1% 
and as high as 17% whereas outside Java Island rated [4]. Patients in consort with carcinoma hepatocellular 
and liver cirrhosis were positive HBs-Ag at 37% to 52% [4]. 

Hepatitis C virus or called as HCV antibody was come across in 0.5% to 3.4% of blood donors, 10% 
to 16% of acute hepatitis, 21% to 41% of hepatocellular carcinoma and 31% to 74% of liver cirrhosis patients 
[4], [5]. In Indonesia, the two most important causes of chronic liver disease are HBV and HCV, although 
25% to 29% of hepatocellular carcinoma and 14% to 25% of liver cirrhosis sufferer had no serologic 
substantiation for HBV or HCV [4], [5]. 


2. RESEARCH METHOD 
2.1. Support vector machines 

Support vector machines or known as SVM is a supervised machine learning model for two group 
classification problems that uses classification algorithms [6]. We’ll able to categorize new examples after 
giving the model sets of label training data for either of two categories [7]. Let {xix} is the dataset where, 
x; € R? is feature of vector, y; 1s class label for x; and N is the number of samples [7]-[11]. This is main 
formula of support vector machines to find the best hyperplane: 


f(x) =w:x+b (1) 


To the hyperplane determining its orientation, that formula contains w (weight) as the orthogonal 
vector, b (bias) as the distance from the origin to the hyperplan, and x indicates the training sample [12]. The 
aim is to maximize the margin [13]. Moreover, SVM goal is construct the two planes, where the plane for the 
positive class is w’x; +b > +1, the plane for the negative class is w’x; +b < —1. Figure | is an 
illustration of support vector machines [14]. 
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Figure 1. Illustration of support vector machines 


The optimization problem of SVM can be summarized as: 


Minimize, 
s llwlP? ”) 
subject to, y;(w? +x; +b) > 1,Vi = 1,...,N (3) 


By solving the problem above, formula of w and b are obtained in (4) and (5): 
_ yy 
W = din1 UYiXi (4) 
1 
b= ny mies Oi — mes ImYmXm) (5) 


Then, decision formulas of SVM can be written as: 
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f(x) = sign(w : x + b) (6) 
In this study, kernel functions are used in support vector machines [15]. The kernel function 
resolves problems that are linear in order to be applied to non linear problems [16]. Especially, for algorithms 


express in inner product between two vectors [16]. There are several kernel functions with the parameters in 
Table 1 [15], [16]. 


Table 1. The several kernel function 


No. Name Kernel function 
1. Linier K(x,x;) = [xi]"3. 
2. Polynomial K(xpx;) = [(c + [xiI"x))]*. 
: 2 
Samson Gg) = em lc al 0) 
Function 
(RBF) 


2.2. Random forest 

Random forest or RF is a flexible and easy to use machine learning algorithm that produces [17]. A 
great result will produce most of the time, even without a hyper parameter tuning [17], [18]. Because RF’s 
simplicity and diversity, random forest also one of the most used algorithms [18]. 

The random forest is a tree-based ensemble which is a combination of each decision tree depending 
on a collection of random variables [19]. The decision tree is a flowchart shaped like vector [20]. For a n- 


‘ . T ‘ : : 
dimensional, the random vector x = (x1, XQy ey x5) represents the predictor variables and a random variable 
y represents the real-valued response [20]. Figure 2 is an illustration of random forest [20]. 
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Figure 2. Illustration of random forest 


Generated by random forests, most of the options depend on two data objects [21]. With 
replacement by sampling, for the current tree when the training set is drawn, one till third of the cases are left 
out of the sample which used to get a running unbiased of this data namely out of bag or OOB that estimate 
the error of the classification and also the variable importance [22], [23]. After being built, for each pair of 
cases, all of the data are run down the tree and proximities are computed [23]. The same terminal node 
occupied by two cases as their proximity by one is increased [23]. At the end of the run, the normalization of 
proximities is by the number of trees divided [24]. In locating outliers proximities are used, also the missing 
data replacement and illuminated producing low dimensional views of the data [24]. 


2.3. Confusion matrix 
To calculate the accuracy, confusion matrix is used. The formula (7) for accuracy is [25]: 


Tpt+Tn 


accuracy = ———————_ 
y Tp+Tyt+Fpt+Fny 


(7) 


Tp: Number of samples having hepatitis classified correctly. 

Fp: Number of healthy people that were incorrectly classified to hepatitis. 

Fy: Number of samples with hepatitis that were incorrectly classified as healthy. 
Ty: Number of healthy individuals correctly spotted. 
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3. RESULTS AND ANALYSIS 
3.1. Data 

The data used in this study are from hepatitis patients who have inflammation in their liver. This 
data amounted to 113 data with 5 features proportioned as 90% training data and 10% testing data from the 
original data, with actual amounts of 84 major data and 29 minor data. Minor data represent data classes that 
indicate the presence of inflammation with label ‘1’ that used for the dataset, while the major data represent 
data classes that do not indicate inflammation with label '2' that used for the dataset. Table 2 explains the 
inflammation data features that will be examined. 


Table 2. The several kernel function 


No. Feature Definition of feature 
1. Gender Sex (Male or Female) 
2 SGOT Serum Glutamic Oxaloacetic Transaminase 
3. SGPT Serum Glutamic Pyruvic Transaminase 
4. Anti-HCV Non-Hepatitis C Virus 
5. Diagnosis _ The Identification of the Nature of an Illness 


3.2. Result 

This research used training data diverse from 10% to 90%. Table 3 shows the results of the performa 
of accuracy of the entire method used and Table 4 shows the results of the comparison of running time of 
each method. As listed in Table 3, the best accuracy obtained was 99.55%, which resulted from the SVM 
model with a gaussian RBF kernel. Followed by, linear kernel SVM (99.13% accuracy) and random forest 
with 98.43% accuracy. Meanwhile, the lowest level of accuracy resulted from the SVM model with a 
polynomial kernel that was equal to 96.64%. 

Gaussian RBF has the best accuracy 100% with 10%-40%, 60% and 80% training data. For linear 
kernel has the best accuracy 100% with 10%-20%, and 40%-50% training data, along with random forest at 
10%-60% training data. On the other side, polynomial kernel has the best accuracy of 100% if the model uses 
10%, 30% and 60% training data. 

In Table 4, gaussian radial basis kernel gives the best performance with an average running time of 
2.3158. Followed by, polynomial kernel SVM with an average running time of 2.3542 and linear kernel SVM 
with an average running time of 2.4578. Lastly is random forest with an average running time of 7.31. 


Table 3. The performance of each method 


Training data mceutacy 
SVM linear — SVM polynomial _ SVM gaussian RBF _ Random forest 

10% 1.0 1.0 1.0 1.0 
20% 1.0 0.9565 1.0 1.0 
30% 0.9705 1.0 1.0 1.0 
40% 1.0 0.9782 1.0 1.0 
50% 1.0 0.9824 0.9824 1.0 
60% 0.9852 1.0 1.0 1.0 
10% 0.9875 0.9875 0.9875 0.9875 
80% 0.9890 0.9890 1.0 0.9890 
90% 0.9901 0.8039 0.9901 0.8823 

Average 0.9913 0.9664 0.9955 0.9843 


Table 4. The comparison of running time of each method 


Running time(s) 
SVM linear _ SVM polynomial _ SVM gaussianRBF _ Random forest 


Training data 


10% 2.3331 2.3688 2.3570 7.31 
20% 2.7447 2.3429 2.8153 7.31 
30% 2.6945 2.8721 2.2034 7.31 
40% 2.2015 2.2014 2.1982 7.31 
50% 2.5598 2.3076 2.3521 7.31 
60% 2.3193 2.1845 2.2880 7.31 
70% 2.4280 2.1740 2.2993 7.31 
80% 2.6149 2.3953 2.1403 7.31 
90% 2.2248 2.3413 2.1886 7.31 
Average 2.4578 2.3542 2.3158 7.31 
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All of the methods are good for classification of the presence of inflammation in the liver leading to 
hepatitis. The highest accuracy resulted with a value of 99% are from the linear kernel and the gaussian RBF. 
However, based on the accuracy and running time, the best method to classify hepatitis is gaussian RBF 
kernel SVM. 


4. CONCLUSION 

Predicting the presence of inflammation in the liver of a patient in diagnosing with machine learning 
can help medical staff to classify hepatitis disease. An early detection can make patients get the right 
treatment that helps them increase their life and reduce the risk. In this study, there are four method used in: 
SVM with linear, polynomial, gaussian RBF kernel, and random forest. The experimental results show that 
the performance of SVM classifiers and Random Forest method are properly and correctly predict the data. 
However, based on our results, if we see both of the performa and running time, support vector machine with 
gaussian RBF is the best one to classify Hepatitis data as we can see in Tables 3 and 4. Hopefully, in the 
future research, this method can be use with a larger dataset so can develop to give more better accuracy for 
predicting or classifying the other diseases. 
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